The goal of this project is to analyze chocolate sales data from the year 2022 to extract meaningful insights and trends. This includes validating and cleaning the dataset, performing univariate, bivariate, and multivariate statistical analysis, and visualizing key patterns. By understanding factors like top-performing salespersons, most popular products, seasonal sales trends, and country-wise performance, the project aims to support data-driven decision-making in marketing, supply chain, and product strategy for chocolate sales.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
raw=pd.read_csv("Chocolate Sales @ Kg\Data.csv")
raw
| Sales Person | Country | Product | Date | Amount | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 04-Jan-22 | $5,320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 01-Aug-22 | $7,896 | 94 |
| 2 | Gigi Bohling | India | Peanut Butter Cubes | 07-Jul-22 | $4,501 | 91 |
| 3 | Jan Morforth | Australia | Peanut Butter Cubes | 27-Apr-22 | $12,726 | 342 |
| 4 | Jehu Rudeforth | UK | Peanut Butter Cubes | 24-Feb-22 | $13,685 | 184 |
| ... | ... | ... | ... | ... | ... | ... |
| 1089 | Karlen McCaffrey | Australia | Spicy Special Slims | 17-May-22 | $4,410 | 323 |
| 1090 | Jehu Rudeforth | USA | White Choc | 07-Jun-22 | $6,559 | 119 |
| 1091 | Ches Bonnell | Canada | Organic Choco Syrup | 26-Jul-22 | $574 | 217 |
| 1092 | Dotty Strutley | India | Eclairs | 28-Jul-22 | $2,086 | 384 |
| 1093 | Karlen McCaffrey | India | 70% Dark Bites | 23-May-22 | $5,075 | 344 |
1094 rows × 6 columns
column info
| column | discription |
|---|---|
| Sales Person | Name of the salesperson responsible for the transaction. |
| Country | Sales region or store location where the transaction took place. |
| Product | Name of the chocolate product sold. |
| Date | The transaction date of the chocolate sale. |
| Amount | Total revenue generated from the sale. |
| Boxes Shipped | Number of chocolate boxes shipped in the order. |
df=raw.copy()
Basic Checks
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1094 entries, 0 to 1093 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sales Person 1094 non-null object 1 Country 1094 non-null object 2 Product 1094 non-null object 3 Date 1094 non-null object 4 Amount 1094 non-null object 5 Boxes Shipped 1094 non-null int64 dtypes: int64(1), object(5) memory usage: 51.4+ KB
# df.Amount=df.Amount.str.strip('$').str.replace(',','').str.strip()
# df['Amount']=df.Amount.astype('int')
# or
df['Amount'] = df['Amount'].replace('[\$,]', '', regex=True).astype(int)
df.rename({'Amount':'Amount in thousand ($)'},axis=1,inplace=True)
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y', errors='coerce')
df.isna().sum()
Sales Person 0 Country 0 Product 0 Date 0 Amount in thousand ($) 0 Boxes Shipped 0 dtype: int64
df
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896 | 94 |
| 2 | Gigi Bohling | India | Peanut Butter Cubes | 2022-07-07 | 4501 | 91 |
| 3 | Jan Morforth | Australia | Peanut Butter Cubes | 2022-04-27 | 12726 | 342 |
| 4 | Jehu Rudeforth | UK | Peanut Butter Cubes | 2022-02-24 | 13685 | 184 |
| ... | ... | ... | ... | ... | ... | ... |
| 1089 | Karlen McCaffrey | Australia | Spicy Special Slims | 2022-05-17 | 4410 | 323 |
| 1090 | Jehu Rudeforth | USA | White Choc | 2022-06-07 | 6559 | 119 |
| 1091 | Ches Bonnell | Canada | Organic Choco Syrup | 2022-07-26 | 574 | 217 |
| 1092 | Dotty Strutley | India | Eclairs | 2022-07-28 | 2086 | 384 |
| 1093 | Karlen McCaffrey | India | 70% Dark Bites | 2022-05-23 | 5075 | 344 |
1094 rows × 6 columns
def colValidate(df,col):
print(f"column : {col}")
print()
print(f"Number of unique values in column: {df[col].nunique()}")
print()
print("Unique Values:")
if df[col].nunique() >=100:
for i in range(0,df[col].nunique(),100):
print(df[col].unique()[i:i+100])
print()
else:
print(df[col].unique())
print()
print(f"Data type of column:{df[col].dtype}")
print()
colValidate(df,'Sales Person')
column : Sales Person Number of unique values in column: 25 Unique Values: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel' 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny' 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet' 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly' 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden' 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell'] Data type of column:object
colValidate(df,'Country')
column : Country Number of unique values in column: 6 Unique Values: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada'] Data type of column:object
colValidate(df,'Product')
column : Product Number of unique values in column: 22 Unique Values: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes' 'Smooth Sliky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites' 'Orange Choco' 'Eclairs' 'Drinking Coco' 'Organic Choco Syrup' 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc' 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco' 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars' '70% Dark Bites'] Data type of column:object
colValidate(df,'Amount in thousand ($)')
column : Amount in thousand ($) Number of unique values in column: 827 Unique Values: [ 5320 7896 4501 12726 13685 5376 3080 3990 2835 4704 3703 1442 168 8379 6790 4067 3017 8799 1085 6888 1267 4753 3003 7672 1652 4025 9492 5061 1722 12446 4284 6839 2163 9583 2653 147 3654 2443 938 14749 4781 6307 7602 9737 6979 4382 5243 4865 8575 91 14798 2205 441 3556 16793 15421 4438 1603 273 3073 6090 10255 2030 19453 9275 6181 9037 12313 5642 2800 959 2002 609 1274 7595 4725 9681 14504 280 63 8001 4032 5859 11095 7182 6881 7154 6188 4221 630 1743 2919 49 1827 13006 1064 11571 5740 1456 5334] [ 4151 8106 126 4697 7798 9408 1939 10906 5929 5579 10927 623 6013 1505 476 11550 17318 1848 7273 854 6832 4606 4466 16114 1561 8897 2464 4830 2765 7126 4263 3605 19929 5103 2541 5663 392 10976 9282 8267 4116 2093 1015 12516 2758 3297 6048 10101 2282 4361 3052 8029 5460 8911 7203 13083 2779 9058 3549 9436 10283 5446 1043 12586 1687 5299 3213 5194 13706 6489 9324 10829 8113 3269 7287 2583 9982 4795 9310 497 581 3472 4333 3325 11718 2100 4018 6468 7238 6454 3115 6475 1162 14238 1428 8064 9660 7357 6055 5124] [ 3479 784 18991 1372 735 6538 3199 4571 5481 3136 252 3395 14938 4053 5565 7161 8883 1351 3171 7910 3108 5075 378 7350 3388 11837 5425 3752 7728 2296 4403 3192 448 4270 70 6762 3745 2639 4389 2604 16569 14658 161 2807 13034 8484 2240 13447 9422 8687 2415 6272 3122 8043 11662 8925 7294 1526 2688 14889 2912 2632 6328 2527 1414 7714 4347 7490 2058 3640 455 10990 5558 1876 5523 11200 5593 7882 10241 1379 3010 1540 6916 1232 602 10185 2170 8673 4760 3374 420 2863 5936 889 16016 10479 1575 8197 840 5691] [ 4193 9016 9870 2576 3178 4676 5502 7462 5803 13888 6867 2317 1218 4109 1953 6797 9226 5733 4312 1638 10815 987 4669 2905 6986 1288 2345 3619 16982 8092 6993 637 6034 980 2821 3563 2996 2436 5509 12992 3724 7133 8617 9198 11823 5775 13125 14287 16233 5313 3577 3528 679 2450 10577 2597 2219 11319 5978 5327 6020 5614 1736 6384 14539 3493 994 1554 966 4935 10024 2506 6524 8148 3948 5271 12327 6167 18340 7014 7119 15491 5747 4550 2191 7623 9023 3402 10507 7721 5033 1960 238 7756 2660 11564 1365 4186 7406 112] [ 8204 2611 15652 4074 12250 2366 6965 5292 588 4046 7042 6713 6440 10885 2387 6678 4515 5852 1750 5782 2870 3094 12761 3696 5222 8939 2156 2380 3339 14980 1512 6657 3836 8771 651 6706 1421 8526 1435 7434 15316 2751 2786 2303 12271 11298 15855 12404 10808 4858 7742 10983 6769 777 3843 6930 8393 3822 6342 6510 10171 5908 10164 9716 22050 9989 4739 3185 8225 14301 1316 3486 13930 8470 77 3381 4102 2975 4137 9541 5152 11116 13076 10213 2485 8715 7 8904 4396 12068 9772 10458 6426 7504 5439 15547 11956 2723 19327 8848] [14336 16401 5173 2534 910 2331 2086 5012 13482 5621 10486 17626 4494 105 3220 4802 5677 945 8757 4816 7532 4340 1799 1127 10038 2926 6279 308 3500 658 12565 3955 8512 504 4256 13573 6566 13503 8680 385 2961 1981 7959 10794 1897 2744 5474 6069 6944 12173 5705 2492 3024 3437 1869 1225 1155 6811 6433 8169 2275 3857 1463 7924 2898 9506 7175 1729 1589 5187 6223 9457 2107 1862 6972 10220 3969 1547 10633 15057 9338 14525 5810 1582 791 9100 9884 3780 4557 5796 84 4746 6237 7483 1309 2989 9625 10031 6587 3311] [15330 14028 8624 1197 483 9534 1694 2933 2044 12656 8722 15750 19481 8099 13727 8659 8463 4914 3087 9205 1358 8498 700 644 826 203 13356 5894 1617 10822 10724 2016 7588 1904 4844 4599 9744 1400 364 7231 8309 3164 490 18032 5131 1141 7980 5229 3423 4326 3997 5180 952 10969 5964 1302 4179 1946 4291 7196 10437 3647 8995 7252 16380 4522 329 3458 4641 4627 868 210 7007 5845 8428 4977 9527 10766 4879 2408 8981 5250 7560 3038 1099 7413 5397 10325 2149 12362 12635 7315 1393 8022 371 16702 13258 10192 7140 9835] [ 749 7770 3507 5600 721 6615 7420 9114 5404 14924 7091 6496 3738 1190 9268 1029 12026 13405 14763 12894 8302 8183 1645 3143 5873 12425 1715 8190 4991 5096 7063 4200 11137 301 17465 10143 3626 3346 6321 9345 15008 7609 10332 819 5754 28 4956 2352 3906 1813 11788 5768 2401 10122 2268 574 1792 9107 12187 5670 2681 1533 9765 3318 4214 4690 13062 12488 14147 10199 11389 3584 12481 8603 6559 18697 10150 4592 4158 5684 693 546 1036 13846 2499 10689 2226 10262 3864 4900 8155 2478 2471 1666 8400 6139 7077 9954 1260 469] [ 973 2520 10500 7952 4872 6594 1470 2674 7364 6818 6776 15099 3612 3710 6111 2702 5887 10647 8589 11781 672 12558 8337 8134 8491 3066 4410] Data type of column:int32
df[df.duplicated()]
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped |
|---|
data = df.copy()
data.head(2)
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896 | 94 |
| column | type |
|---|---|
| Sales Person | Categorical Nominal |
| Country Product | Categorical Nominal |
| product | Categorical Nominal |
| Date | Date and time |
| Amount in thousand ($) | Numerical Continuous |
| Boxes Shipped | Numerical Discrete |
2.2 Analysis
2.2.1 Uni-Variate Analysis - Study of Individual Column data
All visualizations in this project are created using Plotly, a powerful Python library for interactive charts and dashboards. Since Plotly generates dynamic, web-based visualizations, the charts and graphs can only be viewed when the code is executed (i.e., not as static images in a file).
data.head(2)
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896 | 94 |
Taking Uni-Variate Descriptive Stats User-Defined Functions
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from simple_colors import *
############################ Numeric Continuous ############################
def ncstudy(df, col):
print(green("#######################################################",['bold']))
print(green("Taken Numeric Continuous Column:",['bold']), black(col,['bold']))
print(green("#######################################################",['bold']))
print()
print(cyan("Descriptive Stats:",['bold']))
print()
print(blue("******** Measures of Central Tendancy ************", ['bold']))
print(magenta("Mean:",['bold']), round(df[col].mean(),2))
print(magenta("Median:",['bold']), df[col].median())
print(magenta("Mode:",['bold']), df[col].mode()[0]) # Taking first value
print()
print(blue("******** Measures of Dispersion ************",['bold']))
print(magenta("Range:",['bold']), df[col].max()-df[col].min())
print(magenta("Variance:",['bold']), round(df[col].var(),2))
print(magenta("Standard Deviation:",['bold']), round(df[col].std(),2))
print(magenta("Five Number Summary:",['bold']))
print(round(data[col].describe(),2)[['min','25%','50%','75%','max']])
print()
print(blue("******** Measures of Symmetry ************",['bold']))
print(magenta("Skewness:",['bold']), round(df[col].skew(),2))
print(magenta("Kurtosis:",['bold']), round(df[col].kurt(),2))
print()
print(cyan("Visualization:",['bold']))
print()
px.box(df[col], orientation='h', width=650, height=300).show()
print()
############################## Numeric Discrete #################################
def ndstudy(df, col):
print(green("#######################################################",['bold']))
print(green("Taken Numeric Discrete Column:",['bold']), black(col,['bold']))
print(green("#######################################################",['bold']))
print()
print(cyan("Uni-Variate Descriptive Stats:",['bold']))
print()
print("******** Measures of Central Tendancy ************")
print(magenta("Mean:",['bold']), round(df[col].mean()))
print(magenta("Median:",['bold']), round(df[col].median()))
print(magenta("Mode:",['bold']), df[col].mode()[0]) # Taking first value
print()
print("******** Measures of Dispersion ************")
print(magenta("Range:",['bold']), df[col].max()-df[col].min())
print(magenta("Variance:",['bold']), round(df[col].var()))
print(magenta("Standard Deviation:",['bold']), round(df[col].std()))
print(magenta("Five Number Summary:",['bold']))
print(round(data[col].describe())[['min','25%','50%','75%','max']])
print()
print("******** Measures of Symmetry ************")
print(magenta("Skewness:",['bold']), round(df[col].skew(),2))
print(magenta("Kurtosis:",['bold']), round(df[col].kurt(),2))
print()
print(cyan("Visualization:",['bold']))
print()
px.box(df[col], orientation='h', width=650, height=300).show()
print()
############################# Categorical #######################################
def catstudy(df, col):
print(green("#######################################################",['bold']))
print(green("Taken Categorical Column:",['bold']), black(col,['bold']))
print(green("#######################################################",['bold']))
print()
print(cyan("Uni-Variate Descriptive Stats:",['bold']))
print()
print(magenta("Number of Categories/Classes in column:",['bold']), df[col].nunique())
print(magenta("Category Names:",['bold']))
print(df[col].unique())
print()
print(magenta("Value Counts (FD) of each Category:",['bold']))
print(df[col].value_counts())
print()
print(magenta("Value Counts of Each Class (FD) as Percentage:",['bold']))
print(round((df[col].value_counts()/len(df))*100,2))
print()
print(magenta("Mode:",['bold']), df[col].mode()[0])
print()
print(cyan("Visualization:",['bold']))
print()
print(black("Top Catgeories:", ['bold']))
# Considering only top 10 categories for pie chart
index = df[col].value_counts().sort_values(ascending=False)[0:10].index
vals = df[col].value_counts().sort_values(ascending=False)[0:10].values
fig = px.pie(names=index, values=vals, width=700, height=400)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()
print()
######################################## DateTime ######################################
def datestudy(df, col):
print(green("#######################################################",['bold']))
print(green("Taken Date Column:",['bold']), black(col,['bold']))
print(green("#######################################################",['bold']))
print()
print(cyan("Uni-Variate Descriptive Stats:",['bold']))
print()
print(magenta("Start Date:",['bold']), df[col].min())
print(magenta("End Date:",['bold']), df[col].max())
print(magenta("Total Time Period (in Years):",['bold']), (df[col].max()-df[col].min()))
print()
print(cyan("Visualization:",['bold']))
print()
index = df[col].value_counts().index
vals = df[col].value_counts().values
px.scatter(x = index, y = vals, width=500, height=400).show()
print()
for col in data.columns:
if 'Year' in col:
datestudy(data, col)
elif data[col].dtype == object:
catstudy(data, col)
elif data[col].dtype == 'float64':
ncstudy(data, col)
elif data[col].dtype == 'int64':
ndstudy(data, col)
####################################################### Taken Categorical Column: Sales Person ####################################################### Uni-Variate Descriptive Stats: Number of Categories/Classes in column: 25 Category Names: ['Jehu Rudeforth' 'Van Tuxwell' 'Gigi Bohling' 'Jan Morforth' 'Oby Sorrel' 'Gunar Cockshoot' 'Brien Boise' 'Rafaelita Blaksland' 'Barr Faughny' 'Mallorie Waber' 'Karlen McCaffrey' "Marney O'Breen" 'Beverie Moffet' 'Roddy Speechley' 'Curtice Advani' 'Husein Augar' 'Kaine Padly' 'Dennison Crosswaite' "Wilone O'Kielt" 'Andria Kimpton' 'Kelci Walkden' 'Camilla Castle' 'Madelene Upcott' 'Dotty Strutley' 'Ches Bonnell'] Value Counts (FD) of each Category: Kelci Walkden 54 Brien Boise 53 Van Tuxwell 51 Beverie Moffet 50 Dennison Crosswaite 49 Oby Sorrel 49 Ches Bonnell 48 Karlen McCaffrey 47 Gigi Bohling 47 Curtice Advani 46 Kaine Padly 45 Madelene Upcott 45 Marney O'Breen 45 Barr Faughny 43 Jehu Rudeforth 43 Roddy Speechley 43 Gunar Cockshoot 43 Mallorie Waber 41 Jan Morforth 39 Andria Kimpton 39 Husein Augar 38 Dotty Strutley 36 Wilone O'Kielt 34 Rafaelita Blaksland 34 Camilla Castle 32 Name: Sales Person, dtype: int64 Value Counts of Each Class (FD) as Percentage: Kelci Walkden 4.94 Brien Boise 4.84 Van Tuxwell 4.66 Beverie Moffet 4.57 Dennison Crosswaite 4.48 Oby Sorrel 4.48 Ches Bonnell 4.39 Karlen McCaffrey 4.30 Gigi Bohling 4.30 Curtice Advani 4.20 Kaine Padly 4.11 Madelene Upcott 4.11 Marney O'Breen 4.11 Barr Faughny 3.93 Jehu Rudeforth 3.93 Roddy Speechley 3.93 Gunar Cockshoot 3.93 Mallorie Waber 3.75 Jan Morforth 3.56 Andria Kimpton 3.56 Husein Augar 3.47 Dotty Strutley 3.29 Wilone O'Kielt 3.11 Rafaelita Blaksland 3.11 Camilla Castle 2.93 Name: Sales Person, dtype: float64 Mode: Kelci Walkden Visualization: Top Catgeories:
####################################################### Taken Categorical Column: Country ####################################################### Uni-Variate Descriptive Stats: Number of Categories/Classes in column: 6 Category Names: ['UK' 'India' 'Australia' 'New Zealand' 'USA' 'Canada'] Value Counts (FD) of each Category: Australia 205 India 184 USA 179 UK 178 Canada 175 New Zealand 173 Name: Country, dtype: int64 Value Counts of Each Class (FD) as Percentage: Australia 18.74 India 16.82 USA 16.36 UK 16.27 Canada 16.00 New Zealand 15.81 Name: Country, dtype: float64 Mode: Australia Visualization: Top Catgeories:
####################################################### Taken Categorical Column: Product ####################################################### Uni-Variate Descriptive Stats: Number of Categories/Classes in column: 22 Category Names: ['Mint Chip Choco' '85% Dark Bars' 'Peanut Butter Cubes' 'Smooth Sliky Salty' '99% Dark & Pure' 'After Nines' '50% Dark Bites' 'Orange Choco' 'Eclairs' 'Drinking Coco' 'Organic Choco Syrup' 'Milk Bars' 'Spicy Special Slims' 'Fruit & Nut Bars' 'White Choc' 'Manuka Honey Choco' 'Almond Choco' 'Raspberry Choco' 'Choco Coated Almonds' "Baker's Choco Chips" 'Caramel Stuffed Bars' '70% Dark Bites'] Value Counts (FD) of each Category: 50% Dark Bites 60 Eclairs 60 Smooth Sliky Salty 59 White Choc 58 Drinking Coco 56 Spicy Special Slims 54 Organic Choco Syrup 52 After Nines 50 85% Dark Bars 50 Fruit & Nut Bars 50 Milk Bars 49 Peanut Butter Cubes 49 99% Dark & Pure 49 Almond Choco 48 Raspberry Choco 48 Orange Choco 47 Mint Chip Choco 45 Manuka Honey Choco 45 Caramel Stuffed Bars 43 70% Dark Bites 42 Baker's Choco Chips 41 Choco Coated Almonds 39 Name: Product, dtype: int64 Value Counts of Each Class (FD) as Percentage: 50% Dark Bites 5.48 Eclairs 5.48 Smooth Sliky Salty 5.39 White Choc 5.30 Drinking Coco 5.12 Spicy Special Slims 4.94 Organic Choco Syrup 4.75 After Nines 4.57 85% Dark Bars 4.57 Fruit & Nut Bars 4.57 Milk Bars 4.48 Peanut Butter Cubes 4.48 99% Dark & Pure 4.48 Almond Choco 4.39 Raspberry Choco 4.39 Orange Choco 4.30 Mint Chip Choco 4.11 Manuka Honey Choco 4.11 Caramel Stuffed Bars 3.93 70% Dark Bites 3.84 Baker's Choco Chips 3.75 Choco Coated Almonds 3.56 Name: Product, dtype: float64 Mode: 50% Dark Bites Visualization: Top Catgeories:
####################################################### Taken Numeric Discrete Column: Boxes Shipped ####################################################### Uni-Variate Descriptive Stats: ******** Measures of Central Tendancy ************ Mean: 162 Median: 135 Mode: 24 ******** Measures of Dispersion ************ Range: 708 Variance: 14773 Standard Deviation: 122 Five Number Summary: min 1.00 25% 70.00 50% 135.00 75% 229.00 max 709.00 Name: Boxes Shipped, dtype: float64 ******** Measures of Symmetry ************ Skewness: 1.11 Kurtosis: 1.16 Visualization:
Insights
Kelci Walkden is the top-performing salesperson, selling 54 unique product categories, showcasing a wide reach across customer preferences.
Brien Boise, Van Tuxwell, and Beverie Moffet follow closely with 53, 51, and 50 product categories respectively, indicating consistently high performance.
Barr Faughny, Jehu Rudeforth, Roddy Speechley, and Gunar Cockshoot — sold exactly 43 product categories, highlighting a tie in mid-tier performance.
The lowest-performing salesperson is Camilla Castle, with sales across only 32 product categories
Australia emerges as the top market with the highest number of chocolate box purchases (205 boxes), indicating strong regional demand.
India(184) is also in significant market but fall behind Australia in total box shipments.
The most purchased products are 50% Dark Bites and Eclairs, each with 60 units sold, suggesting these are customer favorites.
The least purchased product is Choco Coated Almonds, with only 39 units shipped.
2.2.2 Bi/Multi-Variate Analysis - Study of Data between two or more columns
In the Bi/Multi-Variate we have these variable combinations
data.head(2)
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896 | 94 |
Selecting specific columns combos from above based on business talks , to study data
| Pure Numeric | Mixed Columns | Pure Categorical |
|---|---|---|
| Amount vs Boxes Shipped | Sales Person vs Boxes Shipped | Country vs Product |
| Month vs Boxes Shipped | Country vs Box Shipped | Sales Person vs Country |
| Amount vs Month | Product vs Amount | Sales Person vs Product |
| '' | Product vs Box shipped | '' |
Pure Numeric
Descriptive Stats
Amount vs Boxes Shipped
print("Amount vs Boxes Shipped:")
display(round(data[['Boxes Shipped','Amount in thousand ($)']].corr(),2))
Amount vs Boxes Shipped:
| Boxes Shipped | Amount in thousand ($) | |
|---|---|---|
| Boxes Shipped | 1.00 | -0.02 |
| Amount in thousand ($) | -0.02 | 1.00 |
px.scatter(data, x='Boxes Shipped', y='Amount in thousand ($)').show()
The correlation between "Amount in thousand ($)" and "Boxes Shipped" is -0.02, which is very close to 0.
This means there is almost no linear relationship between the total sales amount and the number of boxes shipped.
Month vs Boxes Shipped
df["Date"] = pd.to_datetime(df["Date"], format="%d-%b-%y")
df["Month"] = df["Date"].dt.strftime('%b')
df["Month_num"] = df["Date"].dt.month
monthly_boxes = df.groupby(["Month_num", "Month"])["Boxes Shipped"].count().reset_index()
monthly_boxes = monthly_boxes.sort_values("Month_num")
fig = px.bar(monthly_boxes, x="Month", y="Boxes Shipped", title="Monthly Chocolate Box Shipments", text_auto=True)
fig.update_layout(xaxis_title="Month", yaxis_title="Total Boxes Shipped")
fig.show()
Insights
Amount vs Month
monthly_amount = df.groupby(["Month_num", "Month"])["Amount in thousand ($)"].sum().reset_index()
monthly_amount = monthly_amount.sort_values("Month_num")
fig = px.line(monthly_amount, x="Month", y="Amount in thousand ($)", title="Monthly Chocolate Sales",markers=True,text="Amount in thousand ($)")
fig.update_layout(xaxis_title="Month", yaxis_title="Amount (in Thousand $)")
fig.update_traces(marker=dict(size=9, color='blue', symbol='circle'),texttemplate='%{text:.0s}',textposition="top center")
fig.show()
Insights
Mixed Columns
Sales Person vs Boxes Shipped
mydata=df.groupby("Sales Person")["Boxes Shipped"].sum().reset_index().sort_values("Boxes Shipped",ascending=False)
fig = px.bar(mydata,x="Sales Person", y="Boxes Shipped", title="Sales Person vs Boxes Shipped",text_auto='.2s')
fig.update_traces(textfont_size=9)
fig.show()
Insights
Country vs Boxes Shipped
mydata=df.groupby("Country")["Boxes Shipped"].count().reset_index()
fig = px.bar(mydata,x="Country", y="Boxes Shipped", title="Country vs Boxes Shipped",text_auto='.2s')
fig.show()
Insights
Product vs Amount
mydata=df.groupby("Product")["Amount in thousand ($)"].sum().reset_index().sort_values("Amount in thousand ($)",ascending=False)
fig = px.bar(mydata,x="Product", y="Amount in thousand ($)", title="Product vs Amount",text_auto='.2s')
fig.update_layout(xaxis_tickangle=-45)
fig.update_traces(textfont_size=9)
fig.show()
Product vs Boxes Shipped
Insights
Smooth Silky Salty is the most purchased product, with a total sales amount of $349.69K (almost $350K).
70% Dark Bites recorded the lowest total amount, with $211.61K in sales.
mydata=df.groupby("Product")["Boxes Shipped"].sum().reset_index().sort_values("Boxes Shipped", ascending=False)
fig = px.bar(mydata,x="Product", y="Boxes Shipped", title="Product vs Boxes Shipped",text_auto='.2s')
fig.update_layout(xaxis_tickangle=-45)
fig.update_traces(textfont_size=10)
fig.show()
Insights
50% Dark Bites had the highest number of boxes shipped, totaling 9.792K.
Choco Coated Almonds had the lowest number of boxes shipped, with 6.464K.
Country vs Amount
df['Amount in thousand ($)'] = df['Amount in thousand ($)'].astype(float)
country_sales = df.groupby('Country')['Amount in thousand ($)'].sum().reset_index()
fig = px.pie(country_sales,names='Country',values='Amount in thousand ($)',title='Total Chocolate Sales by Country',hole=0.3)
fig.show()
Insights
Australia has the highest total chocolate sales among all countries, contributing the largest share in revenue.
India, USA and the UK follow next, but with comparatively lower sales amounts.
Categorical
Country vs Product
fig = px.histogram(df, x="Country", color="Product", barmode="group",title="Country vs Product Distribution")
fig.show()
Insights
Multi-Variate Analysis
Pure Numeric
data.head()
| Sales Person | Country | Product | Date | Amount in thousand ($) | Boxes Shipped | |
|---|---|---|---|---|---|---|
| 0 | Jehu Rudeforth | UK | Mint Chip Choco | 2022-01-04 | 5320 | 180 |
| 1 | Van Tuxwell | India | 85% Dark Bars | 2022-08-01 | 7896 | 94 |
| 2 | Gigi Bohling | India | Peanut Butter Cubes | 2022-07-07 | 4501 | 91 |
| 3 | Jan Morforth | Australia | Peanut Butter Cubes | 2022-04-27 | 12726 | 342 |
| 4 | Jehu Rudeforth | UK | Peanut Butter Cubes | 2022-02-24 | 13685 | 184 |
data.corr(numeric_only=True)
| Amount in thousand ($) | Boxes Shipped | |
|---|---|---|
| Amount in thousand ($) | 1.00 | -0.02 |
| Boxes Shipped | -0.02 | 1.00 |
lower = np.triu(data.corr(numeric_only=True))
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap='viridis', mask=lower)
<Axes: >
Insights
The correlation between "Amount in thousand ($)" and "Boxes Shipped" is -0.02, which is very close to 0.
This means there is almost no linear relationship between the total sales amount and the number of boxes shipped.
In other words, shipping more boxes does not necessarily lead to higher revenue, possibly because:
Some products are low cost but shipped in large quantities.
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x22110eaea50>
Pure Categorical
pd.crosstab(data['Country'], data['Product'])
| Product | 50% Dark Bites | 70% Dark Bites | 85% Dark Bars | 99% Dark & Pure | After Nines | Almond Choco | Baker's Choco Chips | Caramel Stuffed Bars | Choco Coated Almonds | Drinking Coco | ... | Manuka Honey Choco | Milk Bars | Mint Chip Choco | Orange Choco | Organic Choco Syrup | Peanut Butter Cubes | Raspberry Choco | Smooth Sliky Salty | Spicy Special Slims | White Choc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country | |||||||||||||||||||||
| Australia | 16 | 9 | 8 | 11 | 7 | 11 | 9 | 7 | 6 | 12 | ... | 8 | 11 | 7 | 8 | 12 | 6 | 11 | 8 | 8 | 11 |
| Canada | 6 | 8 | 6 | 6 | 7 | 8 | 5 | 6 | 10 | 9 | ... | 8 | 7 | 4 | 8 | 10 | 10 | 6 | 12 | 11 | 9 |
| India | 8 | 7 | 9 | 8 | 10 | 9 | 5 | 9 | 7 | 8 | ... | 4 | 6 | 10 | 7 | 8 | 9 | 7 | 11 | 14 | 7 |
| New Zealand | 7 | 6 | 12 | 5 | 12 | 5 | 8 | 6 | 3 | 6 | ... | 8 | 6 | 14 | 7 | 10 | 8 | 7 | 9 | 8 | 9 |
| UK | 12 | 6 | 6 | 11 | 5 | 8 | 9 | 9 | 9 | 12 | ... | 8 | 9 | 6 | 6 | 6 | 9 | 6 | 11 | 5 | 12 |
| USA | 11 | 6 | 9 | 8 | 9 | 7 | 5 | 6 | 4 | 9 | ... | 9 | 10 | 4 | 11 | 6 | 7 | 11 | 8 | 8 | 10 |
6 rows × 22 columns
px.sunburst(df, path=["Country", "Product"], values="Boxes Shipped")
Single Column Study:
Kelci Walkden is the top-performing salesperson, selling 54 unique product categories, showcasing a wide reach across customer preferences.
Brien Boise, Van Tuxwell, and Beverie Moffet follow closely with 53, 51, and 50 product categories respectively, indicating consistently high performance.
Barr Faughny, Jehu Rudeforth, Roddy Speechley, and Gunar Cockshoot — sold exactly 43 product categories, highlighting a tie in mid-tier performance.
The lowest-performing salesperson is Camilla Castle, with sales across only 32 product categories
Australia emerges as the top market with the highest number of chocolate box purchases (205 boxes), indicating strong regional demand.
India(184) is also in significant market but fall behind Australia in total box shipments.
The most purchased products are 50% Dark Bites and Eclairs, each with 60 units sold, suggesting these are customer favorites.
The least purchased product is Choco Coated Almonds, with only 39 units shipped.
Study of Data between two or more columns
June recorded the highest number of boxes shipped, totaling 163. And February had the lowest number of boxes shipped, with only 110.
January had the highest sales, with chocolates purchased worth approximately $896.1K (almost $900K). And April saw the lowest sales, with an amount of $674.05K.
50% Dark Bites had the highest number of boxes shipped, totaling 9.792K.
Choco Coated Almonds had the lowest number of boxes shipped, with 6.464K.
Australia has the highest total chocolate sales among all countries, contributing the largest share in revenue.
India, USA and the UK follow next, but with comparatively lower sales amounts.